Tokenizing

By tokenizing, you can conveniently split up text by word or by sentence
Tokenizing by word : Words are like the atoms of natural language

  • (Count of Words Result) For example, if you were analyzing a group of job ads, then you might find that the word “Python” comes up often. That could suggest high demand for Python knowledge, but you’d need to look deeper to know more.
    -> sent_tokenize(str: sentences) -> list

Tokenizing by sentence (CONTEXT BASED Results): analyze how those words relate to one another and see more context.

  • Are there a lot of negative words around the word “Python” because the hiring manager doesn’t like Python?
  • Are there more terms from the domain of herpetology than the domain of software development, suggesting that you may be dealing with an entirely different kind of python than you were expecting?
    -> word_tokenize(str: sentences) -> list

Filtering

from nltk.tokenize import sent_tokenize, word_tokenize #tokenization
from nltk.corpus import stopwords
stop_words: list = stopwords.words("english")

Stop words are words that you want to ignore, so you filter them out of your text when you’re processing it. Stop words since they don’t add a lot of meaning to a text in and of themselves.
Examples: 'in', 'is', and 'an'

stopwords.words('english') includes only lowercase versions of stop words

tokens: list = sent_tokenize("Hello there beautiful")
stop_words: list = stopwords.words(english) 
# ^-- only lowercase versions of stop words...

filtered_list = [ # ^-- So you must `.lower()` or `.casefold()` (<-both the same) the `tokens`
    token for token in tokens if token.casefold() not in stop_words
]

Reduce word to their root

NLTK has more than one stemmer, but you’ll be using the Porter stemmer.

Lemmatization is better

Preparation: Tagging Parts of Speech

Basically, tagging speech makes it more accurate
EXAMPLE:

lemmatizer.lemmatize("worst") # worst, assumed worst was a noun
lemmatizer.lemmatize("worst", pos="a") # bad, originally it is an adjective in the context of the sentence 

CODE COPY PASTE #2:

import nltk
from nltk.tokenize import word_tokenize

tokens = word_tokenize("Let's stop learning technology and just start planting. Food will always be in demand")
tokens: list[tuple] = nltk.pos_tag(tokens)

Do the lemmatization now

Reference the CODE SNIPPET #2 above

from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()



# Function to map POS tags from Penn Treebank to WordNet
def penn_to_wordnet(tag):
    # I recommend to remove stop words first 
    if tag.startswith('J'):
        return nltk.corpus.wordnet.ADJ
    elif tag.startswith('V'):
        return nltk.corpus.wordnet.VERB
    elif tag.startswith('N'):
        return nltk.corpus.wordnet.NOUN
    elif tag.startswith('R'):
        return nltk.corpus.wordnet.ADV
    else:
        return nltk.corpus.wordnet.NOUN # Default

# Lemmatize the sentence using the mapped POS tags
lemmatized_words = [
    lemmatizer.lemmatize(word=token, pos=penn_to_wordnet(tag)) for token, tag in tokens
]